SlideShare a Scribd company logo
1 of 45
Download to read offline
Stat405
    Advanced data manipulation


                             Hadley Wickham
Tuesday, 28 September 2010
1. Baby names data
               2. Slicing and dicing revision
               3. Merging data
               4. Group-wise operations




Tuesday, 28 September 2010
Baby names
                   Top 1000 male and female baby
                   names in the US, from 1880 to
                   2008.
                   258,000 records (1000 * 2 * 129)
                   But only five variables: year,
                   name, soundex, sex and prop.

                                      CC BY http://www.flickr.com/photos/the_light_show/2586781132
Tuesday, 28 September 2010
Getting started
               library(plyr)
               library(ggplot2)

               options(stringsAsFactors = FALSE)
               # Can read compressed files
               bnames <- read.csv("baby-names2.csv.bz2")

               # Can read files from website
               births <- read.csv(
                 "http://had.co.nz/stat405/data/births.csv")

               # Unfortunately can't do both at the same time :(



Tuesday, 28 September 2010
> head(bnames, 20)                         > tail(bnames, 20)
       year    name soundex       prop   sex          year     name soundex     prop sex
    1 1880     John    J500   0.081541   boy   257981 2008     Miya    M000 0.000130 girl
    2 1880 William     W450   0.080511   boy   257982 2008     Rory    R600 0.000130 girl
    3 1880    James    J520   0.050057   boy   257983 2008 Desirae     D260 0.000130 girl
    4 1880 Charles     C642   0.045167   boy   257984 2008   Kianna    K500 0.000130 girl
    5 1880 George      G620   0.043292   boy   257985 2008   Laurel    L640 0.000130 girl
    6 1880    Frank    F652   0.027380   boy   257986 2008   Neveah    N100 0.000130 girl
    7 1880 Joseph      J210   0.022229   boy   257987 2008   Amaris    A562 0.000129 girl
    8 1880 Thomas      T520   0.021401   boy   257988 2008 Hadassah    H320 0.000129 girl
    9 1880    Henry    H560   0.020641   boy   257989 2008    Dania    D500 0.000129 girl
    10 1880 Robert     R163   0.020404   boy   257990 2008   Hailie    H400 0.000129 girl
    11 1880 Edward     E363   0.019965   boy   257991 2008   Jamiya    J500 0.000129 girl
    12 1880   Harry    H600   0.018175   boy   257992 2008    Kathy    K300 0.000129 girl
    13 1880 Walter     W436   0.014822   boy   257993 2008   Laylah    L400 0.000129 girl
    14 1880 Arthur     A636   0.013504   boy   257994 2008     Riya    R000 0.000129 girl
    15 1880    Fred    F630   0.013251   boy   257995 2008     Diya    D000 0.000128 girl
    16 1880 Albert     A416   0.012609   boy   257996 2008 Carleigh    C642 0.000128 girl
    17 1880 Samuel     S540   0.008648   boy   257997 2008    Iyana    I500 0.000128 girl
    18 1880   David    D130   0.007339   boy   257998 2008   Kenley    K540 0.000127 girl
    19 1880   Louis    L200   0.006993   boy   257999 2008   Sloane    S450 0.000127 girl
    20 1880     Joe    J000   0.006174   boy   258000 2008 Elianna     E450 0.000127 girl

Tuesday, 28 September 2010
Your turn

                   Extract your name from the dataset. Plot
                   the trend over time.
                   What geom should you use? Do you
                   need any extra aesthetics?




Tuesday, 28 September 2010
hadley <- subset(bnames, name == "Hadley")


     qplot(year, prop, data = hadley, colour = sex,
       geom ="line")
     # :(




Tuesday, 28 September 2010
Your turn

                   Use the soundex variable to extract all
                   names that sound like yours. Plot the
                   trend over time.
                   Do you have any difficulties? Think about
                   grouping.




Tuesday, 28 September 2010
gabi <- subset(bnames, soundex == "G164")
     qplot(year, prop, data = gabi)
     qplot(year, prop, data = gabi, geom = "line")

     qplot(year, prop, data = gabi, geom = "line",
       colour = sex) + facet_wrap(~ name)

     qplot(year, prop, data = gabi, geom = "line",
       colour = sex, group = interaction(sex, name))




Tuesday, 28 September 2010
Sawtooth appearance
                   implies grouping is incorrect.
        0.005




        0.004



                                                                       sex
 prop




        0.003                                                                boy
                                                                             girl


        0.002




        0.001




                 1880        1900   1920   1940   1960   1980   2000
                                           year
Tuesday, 28 September 2010
Slicing
                  and dicing
Tuesday, 28 September 2010
Function               Package
                              subset                    base
                             summarise                  plyr
                             transform                  base
                              arrange                   plyr

                       They all have similar syntax. The first argument
                       is a data frame, and all other arguments are
                       interpreted in the context of that data frame.
                       Each returns a data frame.

Tuesday, 28 September 2010
color   value      color   value
                             blue     1         blue      1
                             black    2         blue      3
                             blue     3         blue      4
                             blue     4
                             black    5




                              subset(df, color == "blue")

Tuesday, 28 September 2010
color   value   color value double
                             blue     1      blue    1     2
                             black    2      black   2     4
                             blue     3      blue    3     6
                             blue     4      blue    4     8
                             black    5      black   5    10




                 transform(df, double = 2 * value)

Tuesday, 28 September 2010
color   value   double
                             blue     1        2
                             black    2        4
                             blue     3        6
                             blue     4        8
                             black    5       10




                 summarise(df, double = 2 * value)

Tuesday, 28 September 2010
color   value   total
                             blue     1       15
                             black    2
                             blue     3
                             blue     4
                             black    5




                 summarise(df, total = sum(value))

Tuesday, 28 September 2010
color   value        color   value
                              4       1             1      2
                              1       2             2      5
                              5       3             3      4
                              3       4             4      1
                              2       5             5      3




                                     arrange(df, color)

Tuesday, 28 September 2010
color   value        color   value
                              4       1             5      3
                              1       2             4      1
                              5       3             3      4
                              3       4             2      5
                              2       5             1      2




                                  arrange(df, desc(color))

Tuesday, 28 September 2010
Your turn

                   Calculate the total, largest and smallest
                   proportions.
                   Reorder the data frame containing your
                   name from highest to lowest popularity.




Tuesday, 28 September 2010
summarise(bnames,
       total = sum(prop),
       largest = max(prop),
       smallest = min(prop))

     arrange(hadley, desc(prop))




Tuesday, 28 September 2010
Brainstorm

                   Thinking about the data, what are some
                   of the trends that you might want to
                   explore? What additional variables would
                   you need to create? What other data
                   sources might you want to use?
                   Pair up and brainstorm for 2 minutes.



Tuesday, 28 September 2010
External      Internal

                                        First/last letter
                      Biblical names
                                            Length
                        Hurricanes
                                            Vowels
                         Ethnicity
                                             Rank
                      Famous people
                                         Sounds-like


                             join          ddply
Tuesday, 28 September 2010
Merging
                              data
Tuesday, 28 September 2010
Combining datasets
          Name instrument        Name band
           John   guitar          John  T
           Paul   bass            Paul  T
          George guitar
          Ringo  drums
                             +   George T
                                 Ringo  T
                                             =   ?
          Stuart  bass            Brian F
           Pete  drums




Tuesday, 28 September 2010
x               y
          Name instrument                 Name band       Name instrument band
           John   guitar                   John  T         John   guitar    T
           Paul   bass                     Paul  T         Paul   bass      T
          George guitar               +   George T    =   George guitar     T
          Ringo  drums                    Ringo  T        Ringo  drums      T
          Stuart  bass                     Brian F        Stuart  bass     NA
           Pete  drums                                     Pete  drums     NA




                                 join(x, y, type = "left")

Tuesday, 28 September 2010
x           y
          Name instrument             Name band       Name instrument band
           John   guitar               John  T         John   guitar   T
           Paul   bass                 Paul  T         Paul   bass     T
          George guitar           +   George T    =   George guitar    T
          Ringo  drums                Ringo  T        Ringo  drums     T
          Stuart  bass                 Brian F         Brian   NA      F
           Pete  drums




                             join(x, y, type = "right")

Tuesday, 28 September 2010
x           y
          Name instrument             Name band       Name instrument band
           John   guitar               John  T        John     guitar   T
           Paul   bass                 Paul  T         Paul    bass     T
          George guitar           +   George T    =   George   guitar   T
          Ringo  drums                Ringo  T        Ringo    drums    T
          Stuart  bass                 Brian F
           Pete  drums




                             join(x, y, type = "inner")

Tuesday, 28 September 2010
x               y
          Name instrument                 Name band       Name instrument band
           John   guitar                   John  T        John     guitar   T
           Paul   bass                     Paul  T         Paul    bass     T
          George guitar               +   George T    =   George   guitar   T
          Ringo  drums                    Ringo  T        Ringo    drums    T
          Stuart  bass                     Brian F        Stuart   bass     NA
           Pete  drums                                     Pete    drums    NA
                                                          Brian     NA      F



                                 join(x, y, type = "full")

Tuesday, 28 September 2010
Type         Action

                                    Include all of x, and
                        "left"
                                    matching rows of y
                                    Include all of y, and
                      "right"
                                    matching rows of x
                                    Include only rows in
                      "inner"
                                        both x and y

                        "full"        Include all rows

Tuesday, 28 September 2010
Your turn

                   Convert from proportions to absolute
                   numbers by combining bnames with births,
                   and then performing the appropriate
                   calculation.




Tuesday, 28 September 2010
bnames2 <- join(bnames, births,
       by = c("year", "sex"))
     tail(bnames2)

     bnames2 <- transform(bnames2, n = prop * births)
     tail(bnames2)

     bnames2 <- transform(bnames2,
       n = round(prop * births))
     tail(bnames2)




Tuesday, 28 September 2010
2000000




          1500000



                                                                                    sex
 births




                                                                                          boy

          1000000                                                                         girl




                                                                                   ild
                                                                              ch
                                                                            n or
                                                                          io f
                                                   d




                                                                        ct ed
          500000
                                                   ue




                                                                      du ed
                                                ss




                                                                    de ne
                                              ti
                                            rs




                                                                  x :
                                                                ta 86
                                           :fi




                                                                   19
                                         36
                                       19




                     1880    1900   1920    1940        1960   1980    2000
                                           year
Tuesday, 28 September 2010
Group-wise
                operations

Tuesday, 28 September 2010
Number of people

                   How do we compute the number of
                   people with each name over all years? It’s
                   pretty easy if you have a single name.
                   How would you do it?




Tuesday, 28 September 2010
hadley <- subset(bnames2, name == "Hadley")
     sum(hadley$n)

     # Or
     summarise(hadley, n = sum(n))

     # But how could we do this for every name?




Tuesday, 28 September 2010
# Split
     pieces <- split(bnames2, list(bnames$name))

     # Apply
     results <- vector("list", length(pieces))
     for(i in seq_along(pieces)) {
       piece <- pieces[[i]]
       results[[i]] <- summarise(piece, n = sum(n))
     }

     # Combine
     result <- do.call("rbind", results)


Tuesday, 28 September 2010
# Or equivalently

     counts <- ddply(bnames2, "name", summarise,
       n = sum(n))




Tuesday, 28 September 2010
Way to split
                              Input data
                                            up input
     # Or equivalently

     counts <- ddply(bnames2, "name", summarise,
       n = sum(n))
                                                      Function to apply to
                                                          each piece
              2nd argument
             to summarise()




Tuesday, 28 September 2010
x           y



       a 2
       a 4
       b 0
       b 5
       c 5
       c 10

Tuesday, 28 September 2010
Split
                                     x   y



         x           y               a 2
       a 2                           a 4
       a 4                           x   y



       b 0                           b 0
       b 5                           b 5
       c 5                           x   y



       c 10                          c 5
                                     c 10
Tuesday, 28 September 2010
Split           Apply
                                     x   y



         x           y               a 2
                                                     3
       a 2                           a 4
       a 4                           x   y



       b 0                           b 0
                                                     2.5
       b 5                           b 5
       c 5                           x   y



       c 10                          c 5
                                                     7.5
                                     c 10
Tuesday, 28 September 2010
Split           Apply         Combine
                                     x   y



         x           y               a 2
                                                     3
       a 2                           a 4
       a 4
                                                                x     y

                                     x   y
                                                                a    2
       b 0                           b 0
                                                     2.5        b    2.5
       b 5                           b 5
                                                                c    7.5
       c 5                           x   y



       c 10                          c 5
                                                     7.5
                                     c 10
Tuesday, 28 September 2010
Your turn

                   Repeat the same operation, but use
                   soundex instead of name. What is the
                   most common sound? What name does
                   it correspond to?




Tuesday, 28 September 2010
scounts <- ddply(bnames2, "soundex", summarise,
       n = sum(n))
     scounts <- arrange(scounts, desc(n))

     # Combine with names
     # When there are multiple possible matches,
     # join picks the first
     scounts <- join(
       scounts, bnames2[, c("soundex", "name")],
       by = "soundex")
     head(scounts, 100)

     subset(bnames, soundex == "L600")


Tuesday, 28 September 2010
# Alternative approach that you'll learn more
     # about on Thursday

     library(stringr)
     scounts <- ddply(bnames2, "soundex", summarise,
       n = sum(n),
       names = str_c(sort(unique(name)), collapse = ","))
     scounts <- arrange(scounts, desc(n))




Tuesday, 28 September 2010

More Related Content

Viewers also liked (14)

Graphical inference
Graphical inferenceGraphical inference
Graphical inference
 
18 cleaning
18 cleaning18 cleaning
18 cleaning
 
13 case-study
13 case-study13 case-study
13 case-study
 
21 spam
21 spam21 spam
21 spam
 
23 data-structures
23 data-structures23 data-structures
23 data-structures
 
17 polishing
17 polishing17 polishing
17 polishing
 
14 case-study
14 case-study14 case-study
14 case-study
 
24 modelling
24 modelling24 modelling
24 modelling
 
15 time-space
15 time-space15 time-space
15 time-space
 
27 development
27 development27 development
27 development
 
R packages
R packagesR packages
R packages
 
19 tables
19 tables19 tables
19 tables
 
Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)
 
22 spam
22 spam22 spam
22 spam
 

More from Hadley Wickham (10)

10 simulation
10 simulation10 simulation
10 simulation
 
09 bootstrapping
09 bootstrapping09 bootstrapping
09 bootstrapping
 
08 functions
08 functions08 functions
08 functions
 
07 problem-solving
07 problem-solving07 problem-solving
07 problem-solving
 
06 data
06 data06 data
06 data
 
05 subsetting
05 subsetting05 subsetting
05 subsetting
 
04 reports
04 reports04 reports
04 reports
 
02 large
02 large02 large
02 large
 
01 intro
01 intro01 intro
01 intro
 
25 fin
25 fin25 fin
25 fin
 

Advanced data manipulation techniques

  • 1. Stat405 Advanced data manipulation Hadley Wickham Tuesday, 28 September 2010
  • 2. 1. Baby names data 2. Slicing and dicing revision 3. Merging data 4. Group-wise operations Tuesday, 28 September 2010
  • 3. Baby names Top 1000 male and female baby names in the US, from 1880 to 2008. 258,000 records (1000 * 2 * 129) But only five variables: year, name, soundex, sex and prop. CC BY http://www.flickr.com/photos/the_light_show/2586781132 Tuesday, 28 September 2010
  • 4. Getting started library(plyr) library(ggplot2) options(stringsAsFactors = FALSE) # Can read compressed files bnames <- read.csv("baby-names2.csv.bz2") # Can read files from website births <- read.csv( "http://had.co.nz/stat405/data/births.csv") # Unfortunately can't do both at the same time :( Tuesday, 28 September 2010
  • 5. > head(bnames, 20) > tail(bnames, 20) year name soundex prop sex year name soundex prop sex 1 1880 John J500 0.081541 boy 257981 2008 Miya M000 0.000130 girl 2 1880 William W450 0.080511 boy 257982 2008 Rory R600 0.000130 girl 3 1880 James J520 0.050057 boy 257983 2008 Desirae D260 0.000130 girl 4 1880 Charles C642 0.045167 boy 257984 2008 Kianna K500 0.000130 girl 5 1880 George G620 0.043292 boy 257985 2008 Laurel L640 0.000130 girl 6 1880 Frank F652 0.027380 boy 257986 2008 Neveah N100 0.000130 girl 7 1880 Joseph J210 0.022229 boy 257987 2008 Amaris A562 0.000129 girl 8 1880 Thomas T520 0.021401 boy 257988 2008 Hadassah H320 0.000129 girl 9 1880 Henry H560 0.020641 boy 257989 2008 Dania D500 0.000129 girl 10 1880 Robert R163 0.020404 boy 257990 2008 Hailie H400 0.000129 girl 11 1880 Edward E363 0.019965 boy 257991 2008 Jamiya J500 0.000129 girl 12 1880 Harry H600 0.018175 boy 257992 2008 Kathy K300 0.000129 girl 13 1880 Walter W436 0.014822 boy 257993 2008 Laylah L400 0.000129 girl 14 1880 Arthur A636 0.013504 boy 257994 2008 Riya R000 0.000129 girl 15 1880 Fred F630 0.013251 boy 257995 2008 Diya D000 0.000128 girl 16 1880 Albert A416 0.012609 boy 257996 2008 Carleigh C642 0.000128 girl 17 1880 Samuel S540 0.008648 boy 257997 2008 Iyana I500 0.000128 girl 18 1880 David D130 0.007339 boy 257998 2008 Kenley K540 0.000127 girl 19 1880 Louis L200 0.006993 boy 257999 2008 Sloane S450 0.000127 girl 20 1880 Joe J000 0.006174 boy 258000 2008 Elianna E450 0.000127 girl Tuesday, 28 September 2010
  • 6. Your turn Extract your name from the dataset. Plot the trend over time. What geom should you use? Do you need any extra aesthetics? Tuesday, 28 September 2010
  • 7. hadley <- subset(bnames, name == "Hadley") qplot(year, prop, data = hadley, colour = sex, geom ="line") # :( Tuesday, 28 September 2010
  • 8. Your turn Use the soundex variable to extract all names that sound like yours. Plot the trend over time. Do you have any difficulties? Think about grouping. Tuesday, 28 September 2010
  • 9. gabi <- subset(bnames, soundex == "G164") qplot(year, prop, data = gabi) qplot(year, prop, data = gabi, geom = "line") qplot(year, prop, data = gabi, geom = "line", colour = sex) + facet_wrap(~ name) qplot(year, prop, data = gabi, geom = "line", colour = sex, group = interaction(sex, name)) Tuesday, 28 September 2010
  • 10. Sawtooth appearance implies grouping is incorrect. 0.005 0.004 sex prop 0.003 boy girl 0.002 0.001 1880 1900 1920 1940 1960 1980 2000 year Tuesday, 28 September 2010
  • 11. Slicing and dicing Tuesday, 28 September 2010
  • 12. Function Package subset base summarise plyr transform base arrange plyr They all have similar syntax. The first argument is a data frame, and all other arguments are interpreted in the context of that data frame. Each returns a data frame. Tuesday, 28 September 2010
  • 13. color value color value blue 1 blue 1 black 2 blue 3 blue 3 blue 4 blue 4 black 5 subset(df, color == "blue") Tuesday, 28 September 2010
  • 14. color value color value double blue 1 blue 1 2 black 2 black 2 4 blue 3 blue 3 6 blue 4 blue 4 8 black 5 black 5 10 transform(df, double = 2 * value) Tuesday, 28 September 2010
  • 15. color value double blue 1 2 black 2 4 blue 3 6 blue 4 8 black 5 10 summarise(df, double = 2 * value) Tuesday, 28 September 2010
  • 16. color value total blue 1 15 black 2 blue 3 blue 4 black 5 summarise(df, total = sum(value)) Tuesday, 28 September 2010
  • 17. color value color value 4 1 1 2 1 2 2 5 5 3 3 4 3 4 4 1 2 5 5 3 arrange(df, color) Tuesday, 28 September 2010
  • 18. color value color value 4 1 5 3 1 2 4 1 5 3 3 4 3 4 2 5 2 5 1 2 arrange(df, desc(color)) Tuesday, 28 September 2010
  • 19. Your turn Calculate the total, largest and smallest proportions. Reorder the data frame containing your name from highest to lowest popularity. Tuesday, 28 September 2010
  • 20. summarise(bnames, total = sum(prop), largest = max(prop), smallest = min(prop)) arrange(hadley, desc(prop)) Tuesday, 28 September 2010
  • 21. Brainstorm Thinking about the data, what are some of the trends that you might want to explore? What additional variables would you need to create? What other data sources might you want to use? Pair up and brainstorm for 2 minutes. Tuesday, 28 September 2010
  • 22. External Internal First/last letter Biblical names Length Hurricanes Vowels Ethnicity Rank Famous people Sounds-like join ddply Tuesday, 28 September 2010
  • 23. Merging data Tuesday, 28 September 2010
  • 24. Combining datasets Name instrument Name band John guitar John T Paul bass Paul T George guitar Ringo drums + George T Ringo T = ? Stuart bass Brian F Pete drums Tuesday, 28 September 2010
  • 25. x y Name instrument Name band Name instrument band John guitar John T John guitar T Paul bass Paul T Paul bass T George guitar + George T = George guitar T Ringo drums Ringo T Ringo drums T Stuart bass Brian F Stuart bass NA Pete drums Pete drums NA join(x, y, type = "left") Tuesday, 28 September 2010
  • 26. x y Name instrument Name band Name instrument band John guitar John T John guitar T Paul bass Paul T Paul bass T George guitar + George T = George guitar T Ringo drums Ringo T Ringo drums T Stuart bass Brian F Brian NA F Pete drums join(x, y, type = "right") Tuesday, 28 September 2010
  • 27. x y Name instrument Name band Name instrument band John guitar John T John guitar T Paul bass Paul T Paul bass T George guitar + George T = George guitar T Ringo drums Ringo T Ringo drums T Stuart bass Brian F Pete drums join(x, y, type = "inner") Tuesday, 28 September 2010
  • 28. x y Name instrument Name band Name instrument band John guitar John T John guitar T Paul bass Paul T Paul bass T George guitar + George T = George guitar T Ringo drums Ringo T Ringo drums T Stuart bass Brian F Stuart bass NA Pete drums Pete drums NA Brian NA F join(x, y, type = "full") Tuesday, 28 September 2010
  • 29. Type Action Include all of x, and "left" matching rows of y Include all of y, and "right" matching rows of x Include only rows in "inner" both x and y "full" Include all rows Tuesday, 28 September 2010
  • 30. Your turn Convert from proportions to absolute numbers by combining bnames with births, and then performing the appropriate calculation. Tuesday, 28 September 2010
  • 31. bnames2 <- join(bnames, births, by = c("year", "sex")) tail(bnames2) bnames2 <- transform(bnames2, n = prop * births) tail(bnames2) bnames2 <- transform(bnames2, n = round(prop * births)) tail(bnames2) Tuesday, 28 September 2010
  • 32. 2000000 1500000 sex births boy 1000000 girl ild ch n or io f d ct ed 500000 ue du ed ss de ne ti rs x : ta 86 :fi 19 36 19 1880 1900 1920 1940 1960 1980 2000 year Tuesday, 28 September 2010
  • 33. Group-wise operations Tuesday, 28 September 2010
  • 34. Number of people How do we compute the number of people with each name over all years? It’s pretty easy if you have a single name. How would you do it? Tuesday, 28 September 2010
  • 35. hadley <- subset(bnames2, name == "Hadley") sum(hadley$n) # Or summarise(hadley, n = sum(n)) # But how could we do this for every name? Tuesday, 28 September 2010
  • 36. # Split pieces <- split(bnames2, list(bnames$name)) # Apply results <- vector("list", length(pieces)) for(i in seq_along(pieces)) { piece <- pieces[[i]] results[[i]] <- summarise(piece, n = sum(n)) } # Combine result <- do.call("rbind", results) Tuesday, 28 September 2010
  • 37. # Or equivalently counts <- ddply(bnames2, "name", summarise, n = sum(n)) Tuesday, 28 September 2010
  • 38. Way to split Input data up input # Or equivalently counts <- ddply(bnames2, "name", summarise, n = sum(n)) Function to apply to each piece 2nd argument to summarise() Tuesday, 28 September 2010
  • 39. x y a 2 a 4 b 0 b 5 c 5 c 10 Tuesday, 28 September 2010
  • 40. Split x y x y a 2 a 2 a 4 a 4 x y b 0 b 0 b 5 b 5 c 5 x y c 10 c 5 c 10 Tuesday, 28 September 2010
  • 41. Split Apply x y x y a 2 3 a 2 a 4 a 4 x y b 0 b 0 2.5 b 5 b 5 c 5 x y c 10 c 5 7.5 c 10 Tuesday, 28 September 2010
  • 42. Split Apply Combine x y x y a 2 3 a 2 a 4 a 4 x y x y a 2 b 0 b 0 2.5 b 2.5 b 5 b 5 c 7.5 c 5 x y c 10 c 5 7.5 c 10 Tuesday, 28 September 2010
  • 43. Your turn Repeat the same operation, but use soundex instead of name. What is the most common sound? What name does it correspond to? Tuesday, 28 September 2010
  • 44. scounts <- ddply(bnames2, "soundex", summarise, n = sum(n)) scounts <- arrange(scounts, desc(n)) # Combine with names # When there are multiple possible matches, # join picks the first scounts <- join( scounts, bnames2[, c("soundex", "name")], by = "soundex") head(scounts, 100) subset(bnames, soundex == "L600") Tuesday, 28 September 2010
  • 45. # Alternative approach that you'll learn more # about on Thursday library(stringr) scounts <- ddply(bnames2, "soundex", summarise, n = sum(n), names = str_c(sort(unique(name)), collapse = ",")) scounts <- arrange(scounts, desc(n)) Tuesday, 28 September 2010